GBE 



Evolutionary Rate and Duplicability in the Arabidopsis 
thaliana Protein-Protein Interaction Network 

David Alvarez-Ponce 1,2 '* and Mario A. Fares 1 ' 2 '* 

department of Abiotic Stress, Integrative and Systems Biology Laboratory, Instituto de Biologia Molecular y Celular de Plantas, Consejo 
Superior de Investigaciones Cientificias (CSIC-UPV), Valencia, Spain 

department of Genetics, University of Dublin, Trinity College, Dublin, Ireland 

*Corresponding author: E-mail: mfares@ibmcp.upv.es; david.alvarez@nuim.ie, david.alvarez.ponce@gmail.com. 
Accepted: November 9, 2012 

Abstract 

Genes show a bewildering variation in their patterns of molecular evolution, as a result of the action of different levels and types of 
selective forces. The factors underlying this variation are, however, still poorly understood. In the last decade, the position of proteins in 
the protein-protein interaction network has been put forward as a determinant factor of the evolutionary rate and duplicability of 
their encoding genes. This conclusion, however, has been based on the analysis of the limited number of microbes and animals for 
which interactome-level data are available (essentially, Escherichia coli, yeast, worm, fly, and humans). Here, we study, for the first 
time, the relationship between the position of proteins in the high-density interactome of a plant (Arabidopsis thaliana) and the 
patterns of molecular evolution of their encoding genes. We found that genes whose encoded products act at the center of the 
network are more evolutionarily constrained than those acting at the network periphery. This trend remains significant when 
potential confounding factors (gene expression level and breadth, duplicability, function, and length of the encoded products) are 
controlled for. Even though the correlation between centrality measures and rates of evolution is generally weak, for some functional 
categories, it is comparable in strength to (or even stronger than) the correlation between evolutionary rates and expression levels or 
breadths. In addition, genes encoding interacting proteins in the network evolve at relatively similar rates. Finally, Arabidopsis proteins 
encoded by duplicated genes are more highly connected than those encoded by singleton genes. This observation is in agreement 
with the patterns observed in humans, but in contrast with those observed in £ coli, yeast, worm, and fly (whose duplicated genes 
tend to act at the periphery of the network), implying that the relationship between duplicability and centrality inverted at least twice 
during eukaryote evolution. Taken together, these results indicate that the structure of the A. thaliana network constrains the 
evolution of its components at multiple levels. 

Key words: network evolution, Arabidopsis interactome, natural selection, rates of evolution, gene duplication, network 
centrality. 



Introduction 

Genes and proteins rarely operate in isolation; on the contrary, 
they often act as parts of complex functional networks of 
interacting molecules. Understanding the function and evolu- 
tion of molecular networks is not only a key step toward 
understanding organisms' function and evolution, but it can 
also aid applications such as metabolic engineering and drug 
discovery and design (for review, see Butcher et al. 2004; 
Korcsmaros et al. 2007; Lee et al. 2011). Furthermore, con- 
sidering the position of genes and proteins in the networks 
in which they participate may provide key insight into the 
evolutionary forces governing their evolution. Indeed, several 



lines of evidence point to a link between the position of pro- 
teins in metabolic and protein-protein interaction networks 
(PINs) and the patterns of molecular evolution of their encod- 
ing genes (reviewed by Cork and Purugganan 2004; Eanes 
201 1; Wagner 201 2). In particular, proteins' network position 
has an effect on their rate of evolution and on the duplicability 
of their encoding genes. Several questions, however, remain 
open. 

Proteins' rates of evolution vary across orders of magni- 
tude, as a result of the action of different levels and types of 
evolutionary forces (Zuckerkandl and Pauling 1965; King and 
Jukes 1969; Ohtaand Kimura 1971; Lietal. 1985). Identifying 
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and understanding the factors responsible for this variability is 
one of the main open questions in Evolutionary Biology. 
Purifying selection is expected to act more severely on genes 
performing functions that are more important for the organ- 
ism's biological fitness, thus resulting in lower rates of evolu- 
tion (Wilson et al. 1977; Kimura 1983); however, the relative 
contribution of genes to fitness remains elusive. Over the last 
decade, the wealth of genomic and functional data has made 
feasible to pursue the formulation of a unifying theory that 
explains the variation of the rates of protein evolution (for 
review, see Herbeck and Wall 2005; Koonin and Wolf 2006; 
Mclnerney 2006; Pal et al. 2006; Rocha 2006; Wolf et al. 
2006). Several factors have been shown to correlate with 
the strength of purifying selection acting on genes, with pat- 
terns of expression being the most prominent: more highly or 
widely expressed genes tend to be more constrained than 
those expressed at low levels or in a narrow range of tissues 
(Duret and Mouchiroud 2000; Pal et al. 2001; Krylov et al. 
2003; Subramanian and Kumar 2004; Wright et al. 2004; 
Drummond et al. 2005, 2006; Ingvarsson 2007; Slotte et al. 
201 1; Yang and Gaut 201 1). Other relevant determinants of 
genes' evolutionary rates include the length (Subramanian 
and Kumar 2004; Ingvarsson 2007) and function 
(Castillo-Davis et al. 2004; Alvarez-Ponce and Mclnerney 
201 1) of the encoded products. However, these factors are 
poor predictors of evolutionary rates (e.g., Ingvarsson 2007; 
Larracuente et al. 2008). 

Remarkably, genes acting at the center of the PIN tend to 
be more selectively constrained than those acting at the net- 
work periphery, a pattern that has thus far been observed in 
Escherichia coli, yeasts, worms, flies, and mammals (Fraser 
et al. 2002; Jordan et al. 2003; Hahn and Kern 2005; Lemos 
et al. 2005; Davids and Zhang 2008; Alvarez-Ponce 2012). 
Consistently, proteins involved in complexes tend to be 
highly conserved (Teichmann 2002). Similar patterns have 
been observed in metabolic networks, whose most connected 
enzymes tend to evolve under stronger selective pressures 
(Vitkup et al. 2006). In addition, genes encoding physically 
interacting proteins tend to evolve at relatively similar rates 
(Fraser et al. 2002; Lemos et al. 2005; Alvarez-Ponce et al. 
2009, 201 1; Cui et al. 2009). Although these trends are sig- 
nificant and consistent across all organisms studied to date, 
they are often weak, and whether they reflect a direct effect 
of network position on genes' rates of evolution has been the 
subject of debate. In particular, some authors have suggested 
that these trends might be a byproduct of the distribution of 
confounding factors across the network, such as genes acting 
at the center of the network being expressed at higher levels 
or to biases in interactomic data sets (Bloom and Adami 2003; 
Batada et al. 2006) (but see Fraser et al. 2003; Fraser and Hirsh 
2004; Fraser 2005; Lemos et al. 2005). 

Genes also widely differ in their duplicability (i.e., the pro- 
pensity to retain duplicated copies after a gene duplication 
event), with some genes remaining as single copies 



(singletons) over long evolutionary periods and others being 
recurrently duplicated. Genes' duplicabilities are known to be 
affected by a number of factors, including the position of the 
encoded proteins in the PIN. However, the direction of the 
relationship between gene duplicability and network centrality 
is not universal. In the PINs of E. coii, yeast, worm, and fly, 
singleton genes tend to occupy more central positions than 
duplicated genes (Hughes and Friedman 2005; Prachumwat 
and Li 2006; Makino et al. 2009; D'Antonio and Ciccarelli 
201 1). This trend has been attributed to a fragility of the net- 
work to duplications affecting its more connected elements. 
Indeed, complexes and pathways are thought to perform 
better with balanced concentrations of their members, and 
the duplication of a given gene is expected to disrupt the 
dosage balance of the interactions in which its encoded prod- 
uct is involved (Birchler et al. 2001; Veitia 2002; Papp et al. 
2003), which may have more deleterious effects for proteins 
with a higher number of interactors. Conversely, human pro- 
teins encoded by duplicated genes tend to be more central to 
the PIN than those encoded by singleton genes (Liang and Li 
2007; Doherty et al. 201 2). This is a derived character resulting 
from the high duplicability of human hubs originated after the 
emergence of Metazoans, implying that the relationship be- 
tween centrality and duplicability underwent modification in 
the vertebrate lineage (D'Antonio and Ciccarelli 2011). The 
factors underlying this shift in the relationship between cen- 
trality and duplicability remain, however, unclear. 

The relationship between network position and genes' pat- 
terns of molecular evolution has been hitherto investigated 
only in the few microorganisms (£ coli and yeast) and animals 
(worm, fly, and human) for which interactome-scale protein- 
protein interaction data are available. The recent availability of 
a relatively high-density interactome for Arabidopsis thaliana 
(Arabidopsis Interactome Mapping Consortium 2011; Stark 
et al. 201 1), together with the availability of the genome se- 
quences for this species (Arabidopsis Genome Initiative 2000) 
and its close relative A. lyrata (Hu et al. 201 1), allowed us to 
investigate, for the first time, the relationship between genes' 
patterns of molecular evolution and their position in a plant 
network. Consistent with patterns observed in other organ- 
isms, we found that genes encoding the most central proteins 
of the network tend to evolve under stronger levels of select- 
ive constraint and that genes encoding physically interacting 
proteins evolve at relatively similar rates, pointing to general 
trends across all of life. The relationship between centrality 
and evolutionary rate is independent of potential confounding 
factors (gene duplicability, expression level and breadth, and 
the length of the encoded products), suggesting a direct effect 
of the network structure on the patterns of molecular evolu- 
tion of its components. Even though the correlation between 
the measures of centrality and rates of evolution is in general 
weak, for some functional categories, it is comparable in 
strength to (or even stronger than) the correlation between 
evolutionary rates and expression levels or breadths. 
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Surprisingly, genes acting at the center of the A. thaliana PIN 
are more likely to be duplicated than those acting at the per- 
iphery, implying that the relationship between centrality and 
duplicability underwent modification not only in vertebrates 
but also in plants. 

Materials and Methods 

Protein-Protein Interaction Data 

The PIN was obtained by merging the A. thaliana networks 
available from the BioGRID database v3. 1 .81 (Stark et al. 201 1 ) 
and from Arabidopsis Interactome Mapping Consortium 
(2011). Only physical interactions among pairs of A thaliana 
proteins were considered. All interactions used in the current 
analysis had been determined experimentally in A thaliana 
(i.e., the data set does not contain interactions inferred com- 
putationally or derived from other species). 

Impact of Natural Selection 

For each gene in the network, we attempted to identify its 1 :1 
ortholog in the A lyrata genome (Hu et al. 201 1) using a best 
reciprocal Basic Local Alignment Search Tool (BLAST) ap- 
proach (using BLASTP and an f-value cut-off of 10~ 10 ). Each 
pair of A thaliana-A. lyrata orthologous protein sequences 
was aligned using ProbCons 1.12 (Do et al. 2005), and the 
resulting alignments were used to guide the alignment of the 
corresponding coding sequences. Estimates of c/ N , d s and co 
were obtained using the one-ratio model M0 from the PAML 
4.4 package (Yang 2007). 

Paralogs Identification 

Each A thaliana protein was used as query in a BLASTP 
(Altschul et al. 1 997) search against the A thaliana proteome. 
Genes were classified as singleton if no hit was obtained with 
an f value < 0.1 or as duplicated if at least a homolog was 
found that met the following criteria: 1) f value < 10~ 10 ; 2) 
the aligned region length (L) was >80% of the length of the 
query sequence; and 3) amino acid identity was >30% if 
L > 1 50 amino acids or 0.06 + 4.8/T 0 3211 + e *P<-^. 000 >] other- 
wise (Rost 1999). Duplicated genes were further classified as 
whole-genome duplication (WGD) genes if classified as such 
in Blanc et al. (2003); otherwise, they were classified as result- 
ing from small-scale genome duplication (SSD). 

Gene Expression Level and Breadth 

Expression data from 79 A thaliana tissues were obtained 
from Schmid et al. (2005). For each gene and tissue, values 
were averaged across the three replicates, and the gene was 
considered to be expressed if it was annotated as "present" in 
at least two of the replicates. For each gene, expression level 
was computed as the median across the 79 tissues, and ex- 
pression breadth was computed as the number of tissues in 
which the gene is expressed. For genes matching multiple 



probe sets, the set yielding a higher expression level was 
used. Probe sets matching multiple genes were discarded. 

Functional Information 

Each A thaliana gene was assigned to one (or sometimes a 
few) eukaryotic orthologous group (KOG) categories using the 
eggNOG v2 database (Muller et al. 2010). 

Ribosomal Proteins 

Proteins were considered to be ribosomal if identified as such 
in Barakat et al. (2001), if their description contained the text 
"ribosomal protein" or if they were assigned to the Gene 
Ontology (Ashburner et al. 2000) terms "large ribosomal sub- 
unit" or "small ribosomal subunit." 

Age of Genes 

Each of the 5,789 A thaliana network proteins was used as 
query in a BLASTP (Altschul et al. 1997) search against the nr 
database (obtained from the National Center for 
Biotechnology Information on January 2012). An f-value 
cut-off of 10~ 6 was used. Only hits whose aligned region 
length was >80% the length of the query sequence were 
considered. Genes presenting hits in prokaryotes or in non- 
plant eukaryotes (non-Embryophyta) were deemed as "an- 
cient." The remaining genes were considered to be "land 
plant-specific." 

Results 

Genes Acting at the Center of the A thaliana PIN Are 
More Selectively Constrained Than Those Acting at the 
Network Periphery 

We assembled a PIN for A thaliana by merging all interactions 
available in the BioGRID database (Stark et al. 201 1) and in 
Arabidopsis Interactome Mapping Consortium (201 1). The re- 
sulting network consisted of 5,789 unique proteins connected 
by 14,368 physical binary interactions. For each protein in the 
network, we computed three centrality measures: 1) the 
number of interactors (degree); 2) the number of shortest 
paths between all pairs of proteins of which the protein is 
part (betweenness; Freeman 1977); and 3) the inverse of 
the average shortest distance to all the other proteins in the 
network (closeness). Using a best reciprocal BLAST approach, 
we found 1:1 A lyrata orthologs for 3,868 of the 5,789 
A thaliana network genes. For each pair of orthologs, the 
impact of natural selection was inferred from the nonsynon- 
ymous to synonymous divergence ratio (co = c/n/c/ s ). Values of 
cd = 1 are expected for genes evolving neutrally, whereas 
co < 1 is indicative of the action of purifying selection preser- 
ving the sequence of the encoded proteins, and co > 1 in a 
number of codons is indicative of the action of positive selec- 
tion (adaptive evolution) driving the fixation of nonsynon- 
ymous substitutions. The estimated co values exhibit a 
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Fig. 1. — Correlation between to and betweenness. 



Table 1 

Spearman's Correlations among the Parameters Considered in the Study 

m d N d s Degree Closeness Betweenness Expression Level Expression Breadth 





0.732*** 












ds 


-0.219*** 


0.310*** 










Degree 


-0.030 


-0.026 


-0.019 








Closeness 


-0.034* 


-0.032* 


0.001 


0.412*** 






Betweenness 


-0.053** 


-0.051** 


-0.017 


0.875*** 


0.436*** 




Expression level 


-0.331*** 


-0.342*** 


-0.049** 


-0.013 


0.066** 


0.046** 


Expression breadth 


-0.245*** 


-0.264*** 


-0.072** 


-0.006 


0.024 


0.035* 0.692*** 


Protein length 


0.050** 


-0.009 


-0.124*** 


-0.017 


-0.042** 


-0.023 -0.088*** 



*P<0.05. 
**P<0.01. 
***P< 1(T 6 . 



median of 0.128, indicating that these genes evolved under 
relatively high levels of selective constraint. 

We found that the 3,868 genes encoding proteins that are 
represented in the network (i.e., those with described inter- 
actors) exhibit lower rates of evolution than those not repre- 
sented in the network (median co values of 0.128 and 0.183, 
respectively; Mann-Whitney test, P<10~ 15 ). In addition, 
among these 3,868 network genes, co values negatively 
correlate with betweenness (Spearman's rank correlation 
coefficient, p = -0.053, P= 0.001; fig. 1) and closeness 
(p = -0.034, P= 0.035): the larger the values of betweenness 
and closeness for a protein, the stronger are the selective 
constraints acting on that protein. Although the correlation 
is only marginally significant for degree (p = -0.030, 
P= 0.063), the 1,467 genes with degree >1 show signifi- 
cantly lower co values than the 2,401 genes with degree = 1 



(median co values of 0.125 and 0.135, respectively; Mann- 
Whitney test, P= 0.005). Taken together, these results 
indicate that the most central genes in the A. thaliana network 
are subject to stronger levels of purifying selection than those 
acting at the network periphery. Similar results were obtained 
for d N but not for d s (table 1), indicating that the distribution 
across the network of selection at the amino acid level is 
the main responsible for the observed trend. 

The Relationship between Centrality and Evolutionary 
Rate Is Independent of Gene Expression Level and 
Breadth, Gene Function, and the Length of the 
Encoded Products 

Having established an association between proteins' central- 
izes and their rates of evolution, we sought to determine 
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whether this association reflected a direct effect of net- 
work position or, on the contrary, it was the result of the 
distribution across the network of a number of factors 
that correlate both with rates of evolution and network 
centralities. 

Levels of selective constraint acting on a gene are known to 
depend on a number of factors, among which the level and 
breadth of gene expression seem to be the most important 
(Duret and Mouchiroud 2000; Pal et al. 2001; Krylov et al. 
2003; Subramanian and Kumar 2004; Wright et al. 2004; 
Drummond et al. 2005, 2006; Ingvarsson 2007; Slotte et al. 
2011; Yang and Gaut 2011). In agreement with previous 
reports, we observed that the co and d N values exhibit a 
strong negative correlation with both expression level and 
breadth in our data set (table 1). Additionally, we observed 
that, similar to the PINs of other organisms (Bloom and Adami 
2003; Lemos et al. 2005), the most central genes in the 
Arabidopsis PIN are more highly and broadly expressed 
than those acting at the periphery: expression levels positively 
correlate with betweenness and closeness and expression 
breadths with betweenness (table 1). Combined, these obser- 
vations raise the possibility that the correlation between 
centrality measures and evolutionary rates could be driven 
by the higher levels and/or breadths of expression of genes 
acting at the center of the PIN. However, partial correlation 
analysis (supplementary table S1, Supplementary Material 
online) shows that the association between co and between- 
ness is independent of expression level and breadth 
(p = -0.038, P= 0.026). 

Another potential confounding factor is protein length, 
as it correlates positively with co (i.e., genes encoding shorter 
proteins tend to be more selectively constrained; Subramanian 
and Kumar 2004; Ingvarsson 2007; table 1) and negatively 
with closeness (i.e., genes acting at the center of the network 
tend to encode shorter proteins, in agreement with previous 
observations in Drosophila; Lemos et al. 2005; table 1). 
However, the correlation between co and both between- 
ness (p = -0.052, P= 0.001) and closeness (p= -0.032, 
p= 0.048) remains significant when protein lengths are con- 
trolled for (supplementary table S1, Supplementary Material 
online), indicating that the tendency for central genes to 
evolve under strong levels of selective constraint is also inde- 
pendent of protein length. Furthermore, the correlation be- 
tween co and betweenness remains significant when protein 
length and expression level and breadth are simultaneously 
controlled for (p = — 0.037, P= 0.028; supplementary table 
S1, Supplementary Material online). 

Genes performing different functions are subject to differ- 
ent levels of selective constraint (e.g., Castillo-Davis et al. 2004; 
Alvarez-Ponce and Mclnerney 201 1 ) and encode proteins with 
different network centralities (Kunin et al. 2004; Cotton and 
Mclnerney 2010; Alvarez-Ponce and Mclnerney 2011). 
Consistently, we found that genes in different categories 
exhibit different co and centrality values (Kruskal— Wallis 



test, co: P<10~ 15 ; degree: P< 10~ 15 ; betweenness: 
P=1.17x10 -11 ; closeness: P= 7.32 x 10~ 10 ). Therefore, 
the observed correlation between centrality and evolutionary 
rate (table 1) could conceivably be the result of these differ- 
ences. To discard this possibility, the correlation between 
evolutionary rate and degree was evaluated separately for 
genes involved in each of the KOG functional categories 
(Tatusov et al. 2003) (similar results were obtained for 
betweenness and closeness; data not shown). The analysis 
was restricted to the 20 KOG categories comprising more 
than 25 A. thaliana network genes. Remarkably, the correl- 
ation coefficient was negative for 18 of the 20 categories 
(table 2). The correlation was significant for only five of 
these categories, probably as a result of the reduced statistical 
power resulting from partitioning the data set. These observa- 
tions indicate that, in spite of the fact that genes with different 
functions exhibit different co and centrality values, the negative 
correlation between centrality and oo is largely independent of 
gene function. 

Remarkably, correlation coefficients are highly variable 
across the different KOG categories (table 2). First, pairwise 
comparison shows that 10 pairs of categories exhibit signifi- 
cantly different correlation coefficients (supplementary 
table S2, Supplementary Material online). Second, for each 
functional category, we compared the correlation coefficient 
for genes within that category with the correlation coefficient 
for all other genes (i.e., all network genes not belonging 
to that category); this analysis shows that categories 
"signal transduction mechanisms" (P= 0.009) and "chroma- 
tin structure and dynamics" (P= 0.025) exhibit significantly 
higher correlation coefficients than the rest of the net- 
work (table 2). Taken together, these results indicate that 
the extent to which sequence evolution is affected by protein 
centrality is dependent on gene function. In fact, some cate- 
gories exhibit strong negative co-degree correlations that 
are comparable in strength to (or even stronger than) the 
co-expression level and the co-expression breadth correlations 
(tables 1 and 2). 

Finally, because genes encoding ribosomal proteins are 
highly conserved, and presumably highly connected, the as- 
sociation between centrality and evolutionary rate observed 
here could potentially be due to these genes. However, 
when these genes are eliminated, the correlation between 
co and betweenness remains significant (p = -0.062, 
P= 1 .34 x 10~ 4 ) and the correlation between co and 
degree — which was not significant in the complete data set 
(table 1)— reaches significance (p = -0.041, P= 0.011). 
Furthermore, these correlations remain significant when ex- 
pression level, expression breadth, and protein length are sim- 
ultaneously controlled for (degree: p = -0.038, P= 0.029; 
betweenness: p = -0.042, P= 0.014). 

Taken together, these observations point to a direct effect 
of the position of proteins in the A. thaliana PIN on the rates of 
evolution of their encoding genes. 
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Genes Encoding Physically Interacting Proteins Are 
Subject to Similar Levels of Selective Constraint 

We considered whether interacting proteins in the A thaliana 
network are subject to relatively similar levels of selective con- 
straint compared with random protein pairs. For that purpose, 
we computed the normalized absolute difference between 
the evolutionary rates of all pairs of interacting genes in the 
network (X): 

x = 1 y l<o;i -(Q/2 1 
~ mj-( ((0/1 + co,- 2 )/2 

Here, m is the number of interactions in the network, and co (1 
and <o, 2 are the co values of the two genes involved in inter- 
action /'. Self-interactions (i.e., interactions among proteins 
encoded by the same gene, a total of 383 in the data set) 
were not considered in this analysis, as PINs are enriched in 
such interactions, which can inflate the average similarity be- 
tween interacting genes (Ispolatov et al. 2005; Alvarez-Ponce 
and Mclnerney 2011). The observed value (X=0.892) was 
compared with a null distribution obtained from a collection 
of random networks with the same proteins, number of inter- 
actions, and degree for each node, which we generated from 
the original network by repeatedly switching pairs of edges 
(as in Luisi et al. 201 2). Out of 1 0,000 random networks, only 
426 showed an X value lower or equal to the observed one, 
indicating that interacting proteins exhibit rates of evolution 
that are more similar than expected from a random network 
(P= 0.0426; fig. 3). Random networks exhibit an average X 
value of 0.900, indicating that the co values for interacting 
genes are 0.8% more similar than random pairs of proteins. 

Ribosomal proteins might also represent a source of con- 
founding bias, as they are subject to strong levels of purifying 
selection and are highly connected to each other. However, 
significant results were obtained when ribosomal proteins 
were also removed from the network (X= 0.889, P= 0.039). 

Centrality and Duplicability in the A thaliana PIN 

Finally, we studied the effect of proteins' network position on 
the duplicability of their encoding genes. Among the 5,789 
genes encoding the A. thaliana PIN, 883 were found to be 
singleton, and 3,532 were deemed as duplicated based on 
similarity searches (the rest of the genes remained unclassified; 
see Materials and Methods). Among duplicated genes, 1,540 
are the result of one of the WGD events that took place in the 
Arabidopsis lineage (Blanc et al. 2003; De Bodt et al. 2005), 
and 1,992 were deemed as the result of SSD events. 
Unexpectedly, proteins encoded by duplicated genes have 
more interactors in the network (average degree of 5.18) 
than those encoded by singleton genes (average degree of 
4.10) (Mann-Whitney test, P= 0.008; fig. 2). These differ- 
ences remain significant when self-interactions and inter- 
actions among proteins encoded by paralogs are removed 
from the analysis (P= 0.049), or when genes are classified 
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Fig. 2. — Average number of interactors for proteins encoded by 
singleton, whole-genome duplication (WGD), and small-scale duplication 
(SSD) genes. Error bars represent the standard error of the means. 
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X 

Fig. 3. — Normalized absolute difference between the co values of 
pairs of interacting genes in the network (X). The arrow points to the 
observed value and the histogram represent the null distribution obtained 
from 1 0,000 random networks. 



as duplicated or singleton based on whether they present 
annotated paralogs in the Ensembl plants database release 
14 (Kersey et al. 2012) (P= 1 .62 x 10~ 4 ). Furthermore, 
degree positively correlates with the number of paralogs 
annotated in Ensembl plants (p = 0.089, P=1.02x 10~ 11 ), 
even when only duplicated genes are considered (p = 0.086, 
P=5.85 x 10~ 9 ). This observation is in agreement with the 
patterns observed in the human interactome (Liang and Li 
2007; D Antonio and Ciccarelli 2011; Doherty et al. 2012) 
but in contrast with those observed in E. coli, yeast, worm, 
and fly, whose duplicated genes tend to encode lowly con- 
nected proteins compared with singleton genes (Hughes and 
Friedman 2005; Prachumwat and Li 2006; Makino et al. 2009; 
D Antonio and Ciccarelli 2011). Among duplicated genes, 
those resulting from WGDs are more highly connected in 
the Arabidopsis network than those resulting from SSD 
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events (average degree of 5.70 and 4.78, respectively; Mann- 
Whitney test, P= 5.63 x 1CT 6 ; fig. 2). 

D'Antonio and Ciccarelli (201 1) found that the relationship 
between centrality and duplicability was different among an- 
cient and metazoan-specific human genes: among genes of 
premetazoan origin, duplicated genes are less central than 
singleton genes, whereas among genes that are specific to 
metazoans, duplicated genes tend to occupy more central 
positions than singleton genes. We considered whether the 
relationship between centrality and duplicability observed in 
Arabidopsis also depended on the age of the genes. For that 
purpose, genes were classified as "ancient" (if they had 
homologs in prokaryotes or in nonplant eukaryotes) or as 
"land plant-specific" (otherwise). A total of 3,114 network 
genes were classified as ancient, and the remaining ones 
were deemed as plant specific. We found that, among 
plant-specific proteins, those encoded by duplicated genes 
are more highly connected than those encoded by singleton 
genes (average degrees of 5.39 and 3.85, respectively; Mann- 
Whitney test, P=4.73x 10~ 6 ). This difference, however, is 
not significant among ancient genes (average degrees for 
proteins encoded by duplicated and singleton genes: 5.06 
and 4.41 interactions, respectively; P= 0.484). Therefore, 
the relationship between centrality and duplicability observed 
in Arabidopsis seems to be specific to plant-specific genes. 

Because duplicated Arabidopsis genes are more selectively 
constrained than singleton genes (Yang and Gaut 2011), 
the lower evolutionary rates of genes central to the A. thaliana 
PIN could potentially be a by-product of their enrichment 
in duplicated genes. However, the correlation between co 
and both betweenness (p = -0.050, P= 0.01 5) and closeness 
(p = -0.049, P= 0.01 6) remains significant when only dupli- 
cated genes are considered. Although these correlations 
are not significant for singleton genes (p = -0.044, 
P= 0.288 for betweenness; p = 0.043, P= 0.293 for close- 
ness), this might result from the small number of genes in 
this category (n = 883). 

Analysis of a High-Quality Subnetwork Provides 
Consistent Results 

Currently available interactomes are the result, to a high 
extent, of the application of high-throughput techniques of 
protein-protein interaction discovery. As a result, interactomic 
data sets are subject to high rates of false positives and nega- 
tives (Bader et al. 2004; Deeds et al. 2006). Given the potential 
that this could be affecting our observations, we repeated 
our analyses in a high-quality subset of the A. thaliana inter- 
actome, containing only reliable interactions. Similar 
to D'Antonio and Ciccarelli (2011), we filtered our data set, 
retaining only interactions that had been determined using 
low-throughput techniques (i.e., more accurate analysis of 
protein interactions on a one-by-one basis), and those that 
had been identified by two or more high-throughput analyses 



independently. The filtered network consisted of 4,808 inter- 
actions connecting 2,798 proteins, out of which 1,842 are 
encoded by genes with 1:1 orthologs in A. lyrata. This repre- 
sents a dramatic decrease in the amount of data when 
compared with the full data set (only 48.3% of the proteins 
and 33.5% of the interactions were present in the high- 
quality subnetwork), which potentially involves a decrease in 
statistical power but also an increase in the signal-to-noise 
ratio. 

Although the correlation between degree and co is not sig- 
nificant in this reduced data set (p = 0.003, P= 0.907), co 
values significantly correlate with both betweenness and 
closeness, with correlation coefficients that are higher in mag- 
nitude than those observed in the entire data set (between- 
ness: p = -0.061, P= 0.008; closeness: p = -0.108, 
P=3.22 x 10~ 6 ). The correlation between co and closeness 
remains significant when the effects of expression level and 
breadth, and protein length, are simultaneously controlled for 
(p = -0.061, P=0.013). Furthermore, genes encoding pro- 
teins that are represented in the network are more selectively 
constrained than those that are not represented in the net- 
work (median co values of 0.120 and 0.176, respectively; 
Mann-Whitney test, P< 10~ 15 ). 

Consistent with the observations in the full interactome, 
duplicated genes are more highly connected than singleton 
genes (average degrees of 3.58 and 2.83, respectively) in the 
high-quality subnetwork. Although the degrees of both 
groups are not significantly different (Mann-Whitney test, 
P=0.132), the duplicated/singleton degree ratio is equivalent 
to that observed in the full data set (1 .27), suggesting that the 
lack of significance in the analysis of the high-quality subnet- 
work may result from the reduced statistical power resulting 
from trimming the data set. 

Discussion 

Results presented here provide multiple evidences linking the 
position of proteins in the A. thaliana PIN and evolutionary 
forces acting on their encoding genes. Remarkably, genes 
acting at the most central positions of the network (measured 
as degree, betweenness, or closeness) are subject to stronger 
levels of purifying selection than those acting at the network 
periphery. The trend is independent of the distribution across 
the network of potential confounding factors (patterns and 
levels of gene expression, gene duplicability, function, and 
protein length), suggesting that network position has a 
direct effect on levels of selective constraint. The observation 
that proteins with more interacting partners are subject to 
stronger levels of selective constraint suggests that direct pro- 
tein-protein interactions impose constraints on protein se- 
quence evolution. Nonetheless, closeness, and in particular 
betweenness, seem to be better predictors of evolutionary 
rates than degree (fig. 1 and table 1). These are global meas- 
ures of network centrality that take into account not only 
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the immediate network context of a protein (as degree does) 
but rather its position in relation to the entire network. 
Therefore, the evolutionary rate of a given protein does not 
only depend on the number of direct interactors but also on its 
broader network context. In particular, betweenness is a 
measure of how the flow of information through the network 
depends on each individual protein (Jeong et al. 2000; 
Wagner and Fell 2001). Proteins bridging the gap between 
parts of the network tend to exhibit a high betweenness 
(Ravasz et al. 2002). Therefore, our observations suggest 
that proteins exerting a high degree of control on information 
flow across the network are particularly relevant for the or- 
ganism's fitness. These results are in agreement with previous 
observations in organisms for which dense PINs are readily 
available (£ coli, yeast, worm, fly, and human; Fraser et al. 
2002; Teichmann 2002; Jordan et al. 2003; Hahn and Kern 
2005; Lemos et al. 2005; Davids and Zhang 2008; 
Alvarez-Ponce 2012), which show that genes acting at the 
center of the network, or those participating in protein com- 
plexes, are more selectively constrained, thereby suggesting 
a general trend across all of life. Furthermore, our results are in 
agreement with previous observations in yeast, worm, and fly 
that betweenness is a better predictor of rates of evolution 
than degree (Hahn and Kern 2005) — a pattern that has been 
also observed in the yeast metabolic network (Lu et al. 
2007) — suggesting a general trend at least in Eukaryotes. 

Further evidence for the relationship between position of 
proteins in the A. thaliana PIN and rates of evolution is pro- 
vided by the observation that genes encoding interacting pro- 
teins tend to be subject to similar levels of selective constraint 
(fig. 3), consistent with previous observations in other inter- 
actomes (Pazos and Valencia 2001; Fraser et al. 2002; Lemos 
et al. 2005; Alvarez-Ponce et al. 2009, 201 1 ; Cui et al. 2009). 
This similarity might in part result from the mutational com- 
pensatory dynamics between amino acids involved in protein- 
protein interactions (Codoher and Fares 2008; Fares et al. 
2011). It should be noted, however, that covariation in the 
evolutionary rates of proteins can obey to other factors such as 
these proteins performing shared biological functions or ex- 
hibiting correlated patterns of expression (Clark et al. 2012). 
These alternative explanations — covariation due to mutational 
compensation, shared function, or coexpression — are not mu- 
tually exclusive. 

Although our analyses reveal a clear association between 
network position and evolutionary rates, the trend is generally 
weak in comparison with other correlates of rates of evolution 
such as gene expression (Duret and Mouchiroud 2000; Pal 
et al. 2001; Krylov et al. 2003; Subramanian and Kumar 
2004; Wright et al. 2004; Drummond et al. 2005, 2006; 
Ingvarsson 2007; Slotte et al. 201 1; Yang and Gaut 201 1 ) — 
compare, for instance, the co-degree correlation (p = -0.030) 
with the co-expression level (p = -0.331) and the co-expression 
breadth (p = -0.245) correlations (table 1). This is in good 
agreement with previous analyses linking network position 



and rates of evolution, which often recover weak effects 
(for review, see Cork and Purugganan 2004; Rocha 2006). 
The weakness of these effects may be the result of the rela- 
tively low fraction of amino acids participating in protein-pro- 
tein interactions. It is remarkable, however, that the strength 
of the correlation between degree and rates of evolution is 
highly variable across the different functional categories. 
Indeed, for some categories, the correlation coefficients of 
the co-degree correlation attain values that are comparable 
with (or even surpass) those for the co-expression level and/ 
or the co-expression breadth correlations (table 2). This sug- 
gests that for certain functional categories, protein-protein 
interactions strongly constrain protein evolution. The 
co-degree correlation is significantly higher for proteins in func- 
tional categories "signal transduction mechanisms" 
(P= 0.009) and "chromatin structure and dynamics" 
(P= 0.025) than for the rest of the network (supplementary 
table S2, Supplementary Material online). Consistently, it has 
been recently observed that degree is a strong correlate of 
rates of evolution among genes involved in the human signal 
transduction network (Alvarez-Ponce 2012). 

Despite the weak effect of network position on rates of 
evolution, a much stronger effect is observed on gene duplic- 
ability. On average, proteins encoded by duplicated genes 
exhibit 26%-27% more interactors than those encoded by 
singleton genes (fig. 2). This higher connectivity of duplicated 
genes has been also observed in humans (Liang and Li 2007; 
D'Antonio and Ciccarelli 2011; Doherty et al. 2012), but the 
opposite pattern was observed in £ coli, yeast, worm, 
and fly — in which genes acting at the periphery of the net- 
work are the ones that tend to undergo duplication (Hughes 
and Friedman 2005; Prachumwat and Li 2006; Makino et al. 
2009; D'Antonio and Ciccarelli 201 1). 

The differential pattern observed in the human interactome 
has been shown to be the result of the high duplicability 
of hubs originated after the emergence of Metazoans 
(D'Antonio and Ciccarelli 2011). Human ancient genes 
(those of premetazoan origin), however, exhibit the same ten- 
dency as observed in £ coli, yeast, worm, and fly: duplications 
tend to occur in genes acting at the periphery of the network 
(D'Antonio and Ciccarelli 201 1). This indicates that the par- 
ticular trend observed in the human interactome is a derived 
character and hence that the relationship between centrality 
and duplicability shifted in the vertebrate lineage. Our obser- 
vations that, also among A. thaliana novel (i.e., plant specific) 
genes, duplicated genes are more connected than singleton 
genes indicate that the relationship between duplicability 
and centrality shifted not only in vertebrates but also in the 
Arabidopsis lineage. 

The higher duplicability of proteins acting at the periphery 
of the £ coli, yeast, worm, and fly PINs has been attributed to 
the fragility of the network to duplication of its more con- 
nected elements. Indeed, the duplication of a given gene is 
expected to disrupt the dosage balance of the interactions in 



Genome Biol. Evol. 4(1 2): 1263-1 274. doi:10.1093/gbe/evs101 Advance Access publication November 18, 2012 



1271 



Alvarez-Ponce and Fares 



GBE 



which it is involved (Birchler et al. 2001; Veitia 2002; Papp 
et al. 2003), which may have more deleterious effects for 
genes with a higher number of interactors. On the other 
hand, highly connected proteins may play a key role in main- 
taining the robustness of the network, thereby making 
networks fragile to mutation or loss of these genes (Albert 
et al. 2000; Jeong et al. 2001). Hence, duplication of highly 
connected proteins might be favored, as they increase the 
robustness of the system (Ekman et al. 2006). These and 
other competing forces probably act with different strengths 
in different organisms, resulting in the contrasting overall 
trends observed in available interactomes. 

Major evolutionary transitions (from prokaryotes to unicel- 
lular eukaryotes, to multicellular eukaryotes, and to land 
plants and vertebrates) were accompanied by reductions of 
orders of magnitude in effective population sizes. As a result, 
vertebrates and land plants exhibit effective population sizes 
that are smaller than those for E. coli, yeast, worm, and fly 
(Lynch 2007). In populations with a small effective size, natural 
selection is less efficient, and hence, the fate of mutations is 
largely determined by random genetic drift. In such popula- 
tions, natural selection may have small power in removing 
duplicates of genes involved in a large number of interactions, 
in spite of the fact that duplication of such genes are expected 
to be deleterious according to the dosage balance hypothesis 
(Birchler et al. 2001; Veitia 2002; Papp et al. 2003). This can 
result in a (partial) suppression of the negative relationship 
between duplicability and centrality predicted by the dosage 
balance hypothesis. It is possible that, under such conditions, 
factors promoting a positive duplicability— centrality relation- 
ship can manifest, thereby resulting in the patterns observed 
in the human (Liang and Li 2007; D'Antonio and Ciccarelli 
2011; Doherty et al. 2012) and Arabidopsis (current work) 
interactomes. The future availability of the interactomes of 
a broader range of organisms, with different effective popu- 
lation sizes, may enable a better understanding of the effect 
of this factor on the relationship between centrality and 
duplicability. 

Among A. thaliana duplicated genes, those resulting from 
the WGD events that took place in this lineage (Blanc et al. 
2003; De Bodt et al. 2005) are more highly connected in the 
PIN than those resulting from SSD events (fig. 2). This result is 
in concert with the dosage-balance hypothesis, which predicts 
that duplication of a gene with interactors may be deleterious 
unless its interacting partners simultaneously coduplicate 
(Birchler et al. 2001; Veitia 2002; Papp et al. 2003). Under 
this scenario, highly connected genes are more likely to retain 
duplicated copies if they are the result of WGDs, as simultan- 
eous duplication of the entire genome maintains the stoichi- 
ometry of all balanced sets (Veitia 2004, 2005). 

The current analysis represents, to our knowledge, the first 
interactome-level evaluation of the relationship between 
the structure of the A. thaliana PIN and the patterns of mo- 
lecular evolution of its components. Taken together, results 



presented here indicate that the network imposes constraints 
on the patterns of molecular evolution of its components at 
multiple levels, from sequence evolution to gene duplication. 
Therefore, genes do not evolve independently but as pieces of 
a more complex system. Currently, the availability of protein- 
protein interaction data for A. thaliana is limited in comparison 
with other model organisms (Stark et al. 201 1). As more data 
become available, the emergence of a more detailed map of 
the A. thaliana interactome will enable a more detailed under- 
standing of its evolution. 

Conclusions 

Results presented here provide multiple evidences linking 
the position of proteins in the A. thaliana PIN and evolutionary 
forces acting on their encoding genes. In agreement with 
previous observations in other organisms, genes acting at 
the most central positions of the network are subject to 
increased levels of selective constraint, and genes encoding 
interacting proteins exhibit correlated rates of evolution. 
Duplicated genes tend to be more central than singleton 
genes, in agreement with the patterns observed in humans, 
but opposite to those observed in E. coli, yeast, worm, 
and fly, thereby indicating that the relationship between cen- 
trality and duplicability underwent modification not only in 
vertebrates but also in plants. Taken together, these results 
indicate that the A. thaliana PIN impose constraints in the 
patterns of molecular evolution of its encoding genes at 
multiple levels. 

Supplementary Material 

Supplementary tables S1 and S2 are available at Genome 
Biology and Evolution online (http://www.gbe.oxfordjournals. 
org/). 
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